[SPARK-16371][SQL] Do not push down filters incorrectly when inner name and outer name are the same in Parquet by HyukjinKwon · Pull Request #14067 · apache/spark

HyukjinKwon · 2016-07-06T10:13:24Z

What changes were proposed in this pull request?

Currently, if there is a schema as below:

root
  |-- _1: struct (nullable = true)
  |    |-- _1: integer (nullable = true)

and if we execute the codes below:

df.filter("_1 IS NOT NULL").count()

This pushes down a filter although this filter is being applied to StructType.(If my understanding is correct, Spark does not pushes down filters for those).

The reason is, ParquetFilters.getFieldMap produces results below:

(_1,StructType(StructField(_1,IntegerType,true)))
(_1,IntegerType)

and then it becomes a Map

(_1,IntegerType)

Now, because of ....lift(dataTypeOf(name)).map(_(name, value)), this pushes down filters for _1 which Parquet thinks is IntegerType. However, it is actually StructType.

So, Parquet filter2 produces incorrect results, for example, the codes below:

df.filter("_1 IS NOT NULL").count()

produces always 0.

This PR prevents this by not finding nested fields.

How was this patch tested?

Unit test in ParquetFilterSuite.

…re the same

HyukjinKwon · 2016-07-06T10:15:32Z

Hi, @rxin @liancheng, I hope this is not missed to 2.0..

HyukjinKwon · 2016-07-06T10:16:25Z

cc @viirya as well.

liancheng · 2016-07-06T10:26:04Z

LGTM pending Jenkins.

2.0.0 RC2 has already been cut. We may have this in 2.0.0 if there was another RC.

HyukjinKwon · 2016-07-06T10:27:23Z

Oh, @liancheng I just corrected some more. Please take another look.. (sorry)

liancheng · 2016-07-06T10:29:30Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala

        !f.metadata.contains(StructType.metadataKeyForOptionalField) ||
          !f.metadata.getBoolean(StructType.metadataKeyForOptionalField)
-      }.map(f => f.name -> f.dataType) ++ fields.flatMap { f => getFieldMap(f.dataType) }
+      }.map(f => f.name -> f.dataType)


Could you please add some comment here?

viirya · 2016-07-06T10:37:22Z

The description seems incorrect. It should be a StringType, instead of IntegerType?

HyukjinKwon · 2016-07-06T10:38:06Z

yes it is not, I just found. I will correct them all. Thank you!

viirya · 2016-07-06T10:42:23Z

Another question is when the inner field is not the same name, we still can push down the filter, right?

viirya · 2016-07-06T10:43:16Z

If so, then this patch seems completely skip all such push down.

HyukjinKwon · 2016-07-06T10:49:44Z

Hm... I thought Spark does not support filter-push down for nested fields.

Running the codes below:

df.filter("_1._1 IS NOT NULL").count()

pushes no fileters..

SparkQA · 2016-07-06T12:12:54Z

Test build #61841 has finished for PR 14067 at commit 39e66ee.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-06T12:14:30Z

Test build #61840 has finished for PR 14067 at commit 54b8348.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-06T12:23:24Z

Test build #61842 has finished for PR 14067 at commit ffaa666.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-06T12:25:52Z

Test build #61843 has finished for PR 14067 at commit 1300042.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2016-07-06T14:13:29Z

Yea, currently Spark SQL doesn't support column pruning and/or filter push-down for nested fields.

viirya · 2016-07-06T14:29:06Z

OK. LGTM then.

rxin · 2016-07-06T17:53:39Z

...e/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala

    }
  }
+
+  test("Do not push down filters incorrectly when inner name and outer name are the same") {


add the jira number here

rxin · 2016-07-06T19:41:54Z

I'm going to merge this and fix some comments myself with another pr. Merging in master/2.0.

…me and outer name are the same in Parquet ## What changes were proposed in this pull request? Currently, if there is a schema as below: ``` root |-- _1: struct (nullable = true) | |-- _1: integer (nullable = true) ``` and if we execute the codes below: ```scala df.filter("_1 IS NOT NULL").count() ``` This pushes down a filter although this filter is being applied to `StructType`.(If my understanding is correct, Spark does not pushes down filters for those). The reason is, `ParquetFilters.getFieldMap` produces results below: ``` (_1,StructType(StructField(_1,IntegerType,true))) (_1,IntegerType) ``` and then it becomes a `Map` ``` (_1,IntegerType) ``` Now, because of ` ....lift(dataTypeOf(name)).map(_(name, value))`, this pushes down filters for `_1` which Parquet thinks is `IntegerType`. However, it is actually `StructType`. So, Parquet filter2 produces incorrect results, for example, the codes below: ``` df.filter("_1 IS NOT NULL").count() ``` produces always 0. This PR prevents this by not finding nested fields. ## How was this patch tested? Unit test in `ParquetFilterSuite`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #14067 from HyukjinKwon/SPARK-16371. (cherry picked from commit 4f8ceed) Signed-off-by: Reynold Xin <rxin@databricks.com>

rxin · 2016-07-06T19:46:24Z

FYI https://github.com/apache/spark/pull/14074/files

Do not push down filters incorrectly when inner name and outer name a…

54b8348

…re the same

Fix comment

39e66ee

Simply do not find nested ones

ffaa666

liancheng reviewed Jul 6, 2016
View reviewed changes

Fix comments

1300042

rxin reviewed Jul 6, 2016
View reviewed changes

asfgit closed this in 4f8ceed Jul 6, 2016

HyukjinKwon deleted the SPARK-16371 branch January 2, 2018 03:40

Conversation

HyukjinKwon commented Jul 6, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

HyukjinKwon commented Jul 6, 2016

Uh oh!

HyukjinKwon commented Jul 6, 2016

Uh oh!

liancheng commented Jul 6, 2016

Uh oh!

HyukjinKwon commented Jul 6, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

liancheng Jul 6, 2016

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jul 6, 2016

Choose a reason for hiding this comment

Uh oh!

viirya commented Jul 6, 2016

Uh oh!

HyukjinKwon commented Jul 6, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

viirya commented Jul 6, 2016

Uh oh!

viirya commented Jul 6, 2016

Uh oh!

HyukjinKwon commented Jul 6, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Jul 6, 2016

Uh oh!

SparkQA commented Jul 6, 2016

Uh oh!

SparkQA commented Jul 6, 2016

Uh oh!

SparkQA commented Jul 6, 2016

Uh oh!

liancheng commented Jul 6, 2016

Uh oh!

viirya commented Jul 6, 2016

Uh oh!

rxin Jul 6, 2016

Choose a reason for hiding this comment

Uh oh!

rxin commented Jul 6, 2016

Uh oh!

rxin commented Jul 6, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

HyukjinKwon commented Jul 6, 2016 •

edited

Loading

HyukjinKwon commented Jul 6, 2016 •

edited

Loading

HyukjinKwon commented Jul 6, 2016 •

edited

Loading

HyukjinKwon commented Jul 6, 2016 •

edited

Loading